This report takes a look at the Financial Contributions made to Presenditial Campaigns in the state of New York for 2016. The primary dataset was downloaded from datasource, however I also created a list of cities in New York along with the latitude and longitude. This data was extracted from city_datasource
The Financial Contribution dataset (after cleaning) contains 167,902 and contains 23 variables, made up of:
| Variable | Name | Meaning / Use |
|---|---|---|
| cmte_id | Committee ID | A 9-character alpha-numeric code assigned to a committee by the Federal Election Commission. |
| cand_id | Candidate ID | A 9-character alpha-numeric code assigned to a candidate by the Federal Election Commission. |
| cand_nm | Candidate Name | Recorded name of the candidate |
| contbr_nm | Contributor Name | Reported name of the contributor. |
| contbr_city | Contributor City | Reported city of the contributor |
| contbr_state | Contributor State | Reported state of the contributor |
| contbr_zip | Contributor Zip Code | Reported zip code of the contributor |
| contbr_employer | Contributor Employer | Reported employer of the contributor |
| contbr_occupation | Contributor Occupation | Reported occupation of the contributor |
| contb_receipt_amt | Contribution Receipt Amount | Reported contribution amount |
| contb_receipt_dt | Contribution Receipt Date | Reported contribution date |
| receipt_desc | Receipt Description | Additional information reported by the committee about a specific contribution |
| memo_cd | Memo Code | ‘X’ indicates the committee has provided additional text to describe a specific |
| memo_text | Memo Text | Additional information reported by the committee about a specific contribution |
| form_tp | Form Type | Indicates what schedule and line number the reporting committee reported a specific transaction |
| file_num | File Number | A unique number assigned to a report and all its associated transactions |
| tran_id | Transaction ID | A unique identifier for each transaction |
| election_tp | Election Type | This code indicates the election for which the contribution was made. EYYYY (election plus election year) |
To help with the analysis I have added some additional fields to the dataset
| Variable | Meaning / Use |
|---|---|
| month | used for grouping data by month |
| week | used for grouping data by week |
| year | used for grouping data by year |
| latitude | stores the latitude based on the reported contribution city |
| longitude | stores the longitude based on the reported contribution city |
| employment_status | stores the employment status of each contributor based on the listed employer |
Note: In order to get the latitude and longitude to match the city, I needed to match the city from the cities dataframe to the cities in the financial dataframe. However initially there was an issue with some of the names not matching up. I was able to use a Python script from a previous project to create matches and fix differences in the spelling of cities.
Once I had added the additional fields, I was able to start analysing the data and to help with this I created a couple of grouping / summaries of the data:
The table below provides details for each of the candidates
| cand_id | cand_nm |
|---|---|
| P60008059 | Bush, Jeb |
| P60005915 | Carson, Benjamin S. |
| P60008521 | Christie, Christopher J. |
| P00003392 | Clinton, Hillary Rodham |
| P60006111 | Cruz, Rafael Edward ‘Ted’ |
| P60007242 | Fiorina, Carly |
| P60007697 | Graham, Lindsey O. |
| P80003478 | Huckabee, Mike |
| P60008398 | Jindal, Bobby |
| P60003670 | Kasich, John R. |
| P60009685 | Lessig, Lawrence |
| P60007671 | O’Malley, Martin Joseph |
| P60007572 | Pataki, George E. |
| P40003576 | Paul, Rand |
| P20003281 | Perry, James R. (Rick) |
| P60006723 | Rubio, Marco |
| P60007168 | Sanders, Bernard |
| P20002721 | Santorum, Richard J. |
| P20003984 | Stein, Jill |
| P80001571 | Trump, Donald J. |
| P60006046 | Walker, Scott |
| P60008885 | Webb, James Henry Jr. |
The main features of this dataset include the candidate and the value of the contributions that they received. The data below shows the break down of the contributions.
## Total Value of Contributions: 46072566
## Total Number of Contributions: 167902
## Average Value of Contribution: 274.4015
## Maximum Contribution Value: 10800
## Minimum Contribution Value: 0.08
## Number of Candidates: 22
## Number of Contributors: 35955
The plot below shows us that there are 2 candidates, P0003392 (Hillary Clinton) and P60006723 (Marco Rubio), that received the highest number of contributions.
The first plot in the group above shows that the data is right skewed with the majority of the contributions been less than or equal to $500. So in order to see the spread of data better I performed a log transform on the transaction amount, which can be seen in the second plot of the group.
Other features in the dataset that would be useful to investigate are:
When first looking at the grouping of the employer and occupation of the employer, I could see that there were:
## Employers: 14735
## Occupations: 6732
This posed an issue with been able to determine if there were any distinct patterns, as some of these could have been similar occupations with different titles or the same employers with different names or recorded differently. So in order to determine if there were any patterns I created an additional variable in the dataset for employment status, based on the listed employer.
| employment_status | contribution_count | contribution_value | avg_contribution | max_contribution |
|---|---|---|---|---|
| EMPLOYED | 92693 | 29817871 | 321.68418 | 10800 |
| SELF EMPLOYED | 25169 | 6525750 | 259.27727 | 10800 |
| NOT EMPLOYED | 20650 | 1731867 | 83.86767 | 5400 |
| RETIRED | 15542 | 2525894 | 162.52052 | 5400 |
| UNKNOWN | 13848 | 5471183 | 395.08835 | 10800 |
From this plot we are able to see that the bulk of the contributions came from contributors that are employed at the time of the contribution.
The plot below shows us that the location of contributions were generally spread out across the state of New York, with a couple of districts (Capital District and CentralNew York), that had a larger number of contributions. This is probably reflective of the population spread across the state of New York and where businesses are generally located.
## [1] "Summary"
## Min. 1st Qu. Median Mean 3rd Qu.
## "2013-10-11" "2015-10-19" "2016-01-13" "2015-12-08" "2016-02-11"
## Max.
## "2016-02-29"
From this plot we can see that the number of contributions has increased overtime. However the data and summary show an outlier in 2013. I believe that this could be related to possible data entry errors or data recorded later than the transaction occurred.
When performing the initial review of the data, I found that there were a couple of outliers that affected the spread of the value of the contributions of the data. These outliers included:
To help see the spread of contributions made, I also used either a SQRT or LOG10 transfrom on the scales when I found that the data was too close together to analyse and interpret. This helped to see the data in more detail. The risk of this when looking at plots, there is a potential for the misinterpreting the data. In order to prevent this you need to look at the scales carefully.
The first relationship I analysed was the relationship between the total value of the contributions made to each candidate. The next plot shows the total value of the contribution per candidate. This plot shows that whilst candidate P60006723 (Marco Rubio), had the highest number of contributions the candidate with the highest value of contributions was P0003392 (Hillary Clinton).
In order to see the spread of the value of the contributions I applied a SQRT coordinate transformation on the y-axis, which can be seen on the plot below.
The plot below shows the spread of contribution values and the number of times a particular value was made. From this plot we can start to see that as the contribution value increases the number of times the calculation is made decreases.
The plots below show that the bulk of the total contributions were made by empoyed contributors, however the highest average contributions came from the unknown employment status. This employment status is made up of contributors who did not have an employer recorded against their contribution. The employed status also had the highest number of contributions, which resonates with the fact that this group has the highest contribution total, but not the highest average. The unknown group also has the lowest number of contributions which pushes up their average value of contributions.
When looking at the contribution values by location we can see similar patterns that occurred with the count of contributions by location. The areas with the higher value correlate with the locations with the higher number of contributions.
The strongest relationship that I observed was the number of contributions been received over time in the dataset. For example as the campaigning process ramps up / progresses further the number of contributions increase. What I had exepcted, but didn’t see was the increase in the total amount been contributed each time period.
During this part of the analysis, I decided to see how the timing or progression of the campaign impacted the amount and number of the contributions. The first part I wanted to investigate was seeing if different buckets / bins of contribution amounts increased or decreased more than others. In order to determine this I broke the contribution amounts into the relevant quartile and from the plot below I could see that whilst the value of contributions per month is increasing for each quartile, the quartile with the greatest increase is occurring for the lowest bucket (0.08-25).
The next analysis I wanted to look at here was the top 5 candidates and to see how their total contribution amounts varied over time.The plots below show that for the most of the top 5 candidates they all have ups and downs, with all candidates dropping around the holiday season. The candidate with the highest / most consistent trend of growth was P60007168 (Sanders, Bernard). The only candidate with the reverse trend was P60008059 (Bush, Jeb). The heatmaps further down below show a similar story for all candidates.
I also wanted to see if different candidates received contributions from different areas more than others, however the plots below show that the top 5 candidates were receiving contributions from similar areas. This may be based on the population spread in New York.
After exploring the data I was able to draw the conclusion that as the campaign progresses the number of contributions increase for most of the candidates, however this does not have a direct impact on the total value of the contributions for each month. This occurs because the greatest growth in the number of contributions is occurring in the lowest quartile $0.08 - $25.00.
When looking at the location of where contributions are made from, I believe the greatest benefit in this would occur when looking at the USA overall, as we would be able to draw a link between the numbers of contributions and the popularity of each candidate by state.